2/11/2021 - Here I will document the process of creating simulated data to test our inter-rater reliability algorithm.
2/25/2021 - Added testing of generalized Cohen's kappa with variation of parameters: number of users, number of codes and number of texts
3/4/2021 - Added new testing of original Cohen's kappa to compare behaviour
First, I coded an algorithm to create simulated codings given some parameters:
Given that we are using non-mutually exclusive categories, each coder can apply one or more codes to each text. To choose which quantity each coder will apply, we will use probabilities. The probability that one coder will apply 1, 2 or 3 codes to one text is set to: [0.7,0.2,0.1]. This means that there is a 70% chance that a coder applies only one code, 20% two codes and 10% one code.
Future work: Extract these probabilities from existing (real) data.
In addition, the code(s) each coder will apply also depend on probabilities. At first we will set equal probabilities for every code and coder. For example, if we have 5 codes, there is a 20% probability of each code being applied to any text.
(maybe) Future work: Try with different probabilities for each code and coder.
The parameters were set to the following values:
We will use different parameters in future iterations. Using the algorithm simulation.py we obtained 4205 different codings. A coding is one code applied to one text by one coder. These simulated codings were saved to the file simulated_data.csv.
import pandas as pd
import plotly.express as px # plotly library
import plotly.figure_factory as ff
from scipy.stats import skewnorm
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Import simulated data to dataframe
simulated_data = pd.read_csv("simulated_data.csv")
n_codes = len(simulated_data["code_id"].unique())
n_coders = len(simulated_data["coder_id"].unique())
n_texts = len(simulated_data["text_id"].unique())
# Show first 10 codings
simulated_data.sort_values(by="text_id").head(10)
| text_id | coder_id | code_id | |
|---|---|---|---|
| 0 | 0 | 0 | 2 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 1 |
| 3 | 0 | 1 | 4 |
| 4 | 0 | 1 | 2 |
| 5 | 0 | 1 | 0 |
| 6 | 0 | 2 | 0 |
| 9 | 1 | 2 | 3 |
| 8 | 1 | 1 | 0 |
| 7 | 1 | 0 | 1 |
The observed agreement of this simulated data is 0.5988.
Agreement is calculated for each text and code. If more than half the coders or none of them applies one code, then they agree on that code for that text. Given that we are using 5 codes, each text can have from 0 to 5 agreements. We will now count agreements by each text.
agreement_by_text = []
for text_id, codings in simulated_data.groupby("text_id"):
agreement_count = 0
for i in range(n_codes):
if len(codings[codings["code_id"] == i]) > n_coders:
agreement_count += 1
elif len(codings[codings["code_id"] == i]) == 0:
agreement_count += 1
agreement_by_text.append({"text_id": text_id, "agreement_count": agreement_count})
agreement_by_text = pd.DataFrame(data = agreement_by_text, columns = ["text_id", "agreement_count"])
Now that we have count of agreement by text, let's visualize how it is distributed.
fig = px.histogram(agreement_by_text, x = "agreement_count", title = "Histogram of agreements per text")
fig.show()
We can observe that the data is similar to a normal distribution. At this point I realized that I could try different distribution in order to get different observed agreements. To do this I created two algorithms: distribution_generator.py and distribution_selector.py.
The first algorithm generates multiple distributions by varying the skew and the scale of the distribution. The skew goes from -50 to 50 in steps of 0.1, while the scale changes from 1 to 100.
Let's visualize how skew changes a distribution.
skew_lower = -1
skew_normal = 0
skew_upper = 1
x1 = skewnorm.rvs(skew_lower, size=10000)
x2 = skewnorm.rvs(skew_normal, size=10000)
x3 = skewnorm.rvs(skew_upper, size=10000)
hist_data = [x1, x2, x3]
group_labels = ['Skew = {}'.format(skew_lower), 'Skew = {}'.format(skew_normal), 'Skew = {}'.format(skew_upper)]
colors = ['#A56CC1', '#A6ACEC', '#63F5EF']
fig = ff.create_distplot(hist_data, group_labels, colors=px.colors.qualitative.Set2,
bin_size=.01, show_rug=False, show_hist=False)
fig.update_layout(title_text='Distributions with different skew')
fig.show()
We can observe that the distribution moves and slightly alters its shape. Now let's visualize how scale changes a distribution.
scale_1 = 1
scale_2 = 2
scale_3 = 3
x1 = skewnorm.rvs(0, size=10000, scale = scale_1)
x2 = skewnorm.rvs(0, size=10000, scale = scale_2)
x3 = skewnorm.rvs(0, size=10000, scale = scale_3)
hist_data = [x1, x2, x3]
group_labels = ['Scale = {}'.format(scale_1), 'Scale = {}'.format(scale_2), 'Scale = {}'.format(scale_3)]
fig = ff.create_distplot(hist_data, group_labels, colors = px.colors.qualitative.Set2,
bin_size=.1, show_rug=False, show_hist=False)
fig.update_layout(title_text='Distributions with different scale')
fig.show()
We can observe that it changes the height of the distribution.
Using both these variations, the algorithm distribution_generator.py creates 18.000 different distributions with agreements going from 0.01 to 0.99. Then distribution_selector.py selects 101 distributions for each different agreement in steps of 0.01 and adding the distributions for agreement 0 and 1. The algorithm selects the distribution for each agreement that is closer to the normal distribution.
Let's visualize some of these distributions.
distributions = pd.read_csv("distributions/selected_distributions_1000.csv")
distributions.head()
| Agreement | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|
| 0 | 0.0000 | 1000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0198 | 944.0 | 26.0 | 19.0 | 9.0 | 2.0 | 0.0 |
| 2 | 0.0296 | 916.0 | 39.0 | 29.0 | 13.0 | 3.0 | 0.0 |
| 3 | 0.0396 | 888.0 | 52.0 | 38.0 | 18.0 | 4.0 | 0.0 |
| 4 | 0.0494 | 860.0 | 65.0 | 48.0 | 22.0 | 5.0 | 0.0 |
agreements = [0, 0.1092,0.2098, 0.3060, 0.4046,0.5128, 0.6020,0.7016, 0.8012, 0.9000,1]
hist_data = []
for agreement in agreements:
quantities = distributions[distributions["Agreement"] == agreement].values.tolist()[0][1:]
data = []
for i in range(n_codes + 1):
data += ([float(i)] * int(quantities[i]))
hist_data.append(data)
group_labels = ["Agreement = {}".format(str(i)) for i in agreements]
fig = make_subplots(rows = -(-len(agreements)//2), cols = 2, subplot_titles = (["Agreement: " + str(i) for i in agreements]))
for i in range(len(agreements)):
trace = go.Histogram(x = hist_data[i], name = "Agreement: {}".format(agreements[i]))
fig.append_trace(trace, 1 + (i//2), (i %2)+ 1)
fig.update_layout(height = 1500, title_text="Distributions of different agreements",showlegend=False)
fig.update_xaxes(range=[-0.5, 5.5], tickmode = 'linear', tick0 = 1, dtick = 1)
fig.update_yaxes(range=[0, 1000])
fig.show()
In the following visualization we can see all the distributions at the same time.
agreements = distributions["Agreement"]
hist_data = {}
for agreement in agreements:
quantities = distributions[distributions["Agreement"] == agreement].values.tolist()[0][1:]
data = []
for i in range(n_codes + 1):
data += ([float(i)] * int(quantities[i]))
hist_data[agreement] = data
df = pd.DataFrame(list(hist_data.items()),columns = ['agreement','values'])
s = df.apply(lambda x: pd.Series(x['values']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'agreement_count'
df = df.drop('values', axis=1).join(s)
fig = px.histogram(df, x = "agreement_count",
animation_frame = "agreement",
color_discrete_sequence=['purple'])
fig.update_layout(title_text="Distributions of agreement 0.0",
showlegend=False,
xaxis_title_text = 'Agreement by text', # xaxis label
yaxis_title_text = "")
fig.update_xaxes(range=[-0.5, 5.5], tickmode = 'linear', tick0 = 1, dtick = 1)
fig.update_yaxes(range=[0, 1000])
for i, frame in enumerate(fig.frames):
frame.layout.title = "Distributions of agreement: {}".format(list(hist_data.keys())[i])
for step in fig.layout.sliders[0].steps:
step["args"][1]["frame"]["redraw"] = True
fig.show()
Using these 101 distributions we use simulation.py again to generate one dataset for each distribution. With each one of these datasets we run Generalized Cohen's kappa and obtain observed agreement, chance agreement and kappa score.
kappa_scores = pd.read_csv("distributions/output/gck_output.csv")
kappa_scores
| initial_agreement | observed_agreement | chance_agreement | kappa | |
|---|---|---|---|---|
| 0 | 0.0000 | 0.0000 | 0.555143 | -1.247914 |
| 1 | 0.0198 | 0.0198 | 0.555711 | -1.206221 |
| 2 | 0.0296 | 0.0296 | 0.555248 | -1.181892 |
| 3 | 0.0396 | 0.0396 | 0.555628 | -1.161252 |
| 4 | 0.0494 | 0.0494 | 0.555778 | -1.139923 |
| ... | ... | ... | ... | ... |
| 96 | 0.9604 | 0.9604 | 0.560274 | 0.909944 |
| 97 | 0.9700 | 0.9700 | 0.560193 | 0.931788 |
| 98 | 0.9802 | 0.9802 | 0.561477 | 0.954848 |
| 99 | 0.9900 | 0.9900 | 0.559857 | 0.977280 |
| 100 | 1.0000 | 1.0000 | 0.559591 | 1.000000 |
101 rows × 4 columns
With the following visualization we are going to observe how observed and chance agreement behaves across different agreements.
fig = go.Figure()
fig.add_trace(go.Scatter(x = kappa_scores["initial_agreement"],
y=kappa_scores["observed_agreement"],
name='Observed agreement'))
fig.add_trace(go.Scatter(x = kappa_scores["initial_agreement"],
y=kappa_scores["chance_agreement"],
name='Chance agreement'))
fig.update_layout(title='Observed and chance agreement v. initial agreement',
xaxis_title='Initial agreement')
fig.show()
Now let's look at how generalized Cohen's kappa behaves given this simulated data.
fig = px.line(kappa_scores, x="initial_agreement", y="kappa")
fig.update_layout(title='Generalized Cohen\'s kappa v. initial agreement',
xaxis_title='Initial agreement', yaxis_title = "Kappa score")
fig.show()
Why do we have kappa = -1.2?
Let's consider the Cohen's Kappa equation: $k = \begin{equation}\frac{P_o -P_e}{1-P_e}\end{equation}$, with $P_o\in[0,1]$ and $P_e\in[0,0.99]$, then:
We can observe that there is a correlation between both scores which is what we expected to see. As future work we will try with different parameters and random seeds.
To test the behaviour of the algorithm we varied different parameters Number of users Number of codes Number of texts
We varied the number of users from 2 to 5, maintaining the number of texts as 1000 and number of codes as 5
Let's first look at the variation in chance agreement between these four values as we already know that the observed agreement is calculated the same.
users_2 = pd.read_csv("output/t:1000,c:5,u:2.csv")
users_3 = pd.read_csv("output/t:1000,c:5,u:3.csv")
users_4 = pd.read_csv("output/t:1000,c:5,u:4.csv")
users_5 = pd.read_csv("output/t:1000,c:5,u:5.csv")
users_2.sort_values(by= "agreement", inplace = True)
users_3.sort_values(by= "agreement", inplace = True)
users_4.sort_values(by= "agreement", inplace = True)
users_5.sort_values(by= "agreement", inplace = True)
fig = go.Figure()
fig.add_trace(go.Scatter(x = users_2["agreement"],
y=users_2["chance"],
name='U=2'))
fig.add_trace(go.Scatter(x = users_3["agreement"],
y=users_3["chance"],
name='U=3'))
fig.add_trace(go.Scatter(x = users_4["agreement"],
y=users_4["chance"],
name='U=4'))
fig.add_trace(go.Scatter(x = users_5["agreement"],
y=users_5["chance"],
name='U=5'))
fig.update_layout(title='Chance agreement v. initial agreement - Varying amount of users',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
Now let's look at the generalized Cohen's kappa score
fig = go.Figure()
fig.add_trace(go.Scatter(x = users_2["agreement"],
y=users_2["kappa"],
name='U=2'))
fig.add_trace(go.Scatter(x = users_3["agreement"],
y=users_3["kappa"],
name='U=3'))
fig.add_trace(go.Scatter(x = users_4["agreement"],
y=users_4["kappa"],
name='U=4'))
fig.add_trace(go.Scatter(x = users_5["agreement"],
y=users_5["kappa"],
name='U=5'))
fig.update_layout(title='Generalized Cohen\'s kappa v. initial agreement - Varying amount of users',
xaxis_title='Initial agreement')
fig.show()
We varied the number of codes using the values 3, 5 and 7, maintaining the number of texts as 1000 and number of users as 3.
Let's first look at the variation in chance agreement between these three values as we already know that the observed agreement is calculated the same.
codes_3 = pd.read_csv("output/t:1000,c:3,u:3.csv")
codes_5 = pd.read_csv("output/t:1000,c:5,u:3.csv")
codes_7 = pd.read_csv("output/t:1000,c:7,u:3.csv")
codes_3.sort_values(by= "agreement", inplace = True)
codes_5.sort_values(by= "agreement", inplace = True)
codes_7.sort_values(by= "agreement", inplace = True)
fig = go.Figure()
fig.add_trace(go.Scatter(x = codes_3["agreement"],
y=codes_3["chance"],
name='C=3'))
fig.add_trace(go.Scatter(x = codes_5["agreement"],
y=codes_5["chance"],
name='C=5'))
fig.add_trace(go.Scatter(x = codes_7["agreement"],
y=codes_7["chance"],
name='C=7'))
fig.update_layout(title='Chance agreement v. initial agreement - Varying amount of codes',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
Now let's look at the generalized Cohen's kappa score
fig = go.Figure()
fig.add_trace(go.Scatter(x = codes_3["agreement"],
y=codes_3["kappa"],
name='C=3'))
fig.add_trace(go.Scatter(x = codes_5["agreement"],
y=codes_5["kappa"],
name='C=5'))
fig.add_trace(go.Scatter(x = codes_7["agreement"],
y=codes_7["kappa"],
name='C=7'))
fig.update_layout(title='Generalized Cohen\'s kappa v. initial agreement - Varying amount of codes',
xaxis_title='Initial agreement')
fig.show()
We varied the number of texts using the values 1000, 3000 and 5000, maintaining the number of users as 3 and number of codes as 5.
Let's first look at the variation in chance agreement.
texts_1000 = pd.read_csv("output/t:1000,c:5,u:3.csv")
texts_3000 = pd.read_csv("output/t:3000,c:5,u:3.csv")
texts_5000 = pd.read_csv("output/t:5000,c:5,u:3.csv")
texts_1000.sort_values(by= "agreement", inplace = True)
texts_3000.sort_values(by= "agreement", inplace = True)
texts_5000.sort_values(by= "agreement", inplace = True)
fig = go.Figure()
fig.add_trace(go.Scatter(x = texts_1000["agreement"],
y=texts_1000["chance"],
name='T=1000'))
fig.add_trace(go.Scatter(x = texts_3000["agreement"],
y=texts_3000["chance"],
name='T=3000'))
fig.add_trace(go.Scatter(x = texts_5000["agreement"],
y=texts_5000["chance"],
name='T=5000'))
fig.update_layout(title='Chance agreement v. initial agreement - Varying amount of texts',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
Now let's look at the generalized Cohen's kappa score
fig = go.Figure()
fig.add_trace(go.Scatter(x = texts_1000["agreement"],
y=texts_1000["kappa"],
name='T=1000'))
fig.add_trace(go.Scatter(x = texts_3000["agreement"],
y=texts_3000["kappa"],
name='T=3000'))
fig.add_trace(go.Scatter(x = texts_5000["agreement"],
y=texts_5000["kappa"],
name='T=5000'))
fig.update_layout(title='Generalized Cohen\'s kappa v. initial agreement - Varying amount of texts',
xaxis_title='Initial agreement')
fig.show()
I created new simulated datasets that fit the restrictions of Cohen's kappa. Only two users, five codes which are mutually exclusive and 1000 texts.
Let's look at the results
cohen_scores = pd.read_csv("distributions/output/original_cohen.csv")
cohen_scores.head(5)
| agreement | observed | chance | kappa | time | |
|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0.199894 | -0.249834 | 1.976 |
| 1 | 0.05 | 0.05 | 0.200113 | -0.187668 | 1.001 |
| 2 | 0.10 | 0.10 | 0.200073 | -0.125103 | 1.239 |
| 3 | 0.15 | 0.15 | 0.200276 | -0.062867 | 1.069 |
| 4 | 0.20 | 0.20 | 0.200016 | -0.000020 | 1.051 |
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["observed"],
name='Observed agreement'))
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["chance"],
name='Chance agreement'))
fig.update_layout(title='Observed and chance agreement v. initial agreement',
xaxis_title='Initial agreement')
fig.show()
It's really interesting to see that chance agreement is almost constant just as generalized Cohen's kappa. Let's look at the Cohen's kappa score now.
fig = px.line(cohen_scores, x="agreement", y="kappa")
fig.update_layout(title='Cohen\'s kappa v. initial agreement',
xaxis_title='Initial agreement', yaxis_title = "Kappa score")
fig.show()
We can observe that it also goes from a negative number but not as low as generalized. Let's compare them side by side
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["chance"],
name='Original Cohen\'s kappa'))
fig.add_trace(go.Scatter(x = users_2["agreement"],
y=users_2["chance"],
name='GCK U=2'))
fig.add_trace(go.Scatter(x = users_3["agreement"],
y=users_3["chance"],
name='GCK U=3'))
fig.add_trace(go.Scatter(x = users_4["agreement"],
y=users_4["chance"],
name='GCK U=4'))
fig.add_trace(go.Scatter(x = users_5["agreement"],
y=users_5["chance"],
name='GCK U=5'))
fig.update_layout(title='Generalized Cohen\'s kappa v. Original Cohen\'s kappa - Chance agreement',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
Let's compare the kappa score now
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["kappa"],
name='Original Cohen\'s kappa'))
fig.add_trace(go.Scatter(x = users_2["agreement"],
y=users_2["kappa"],
name='GCK U=2'))
fig.add_trace(go.Scatter(x = users_3["agreement"],
y=users_3["kappa"],
name='GCK U=3'))
fig.add_trace(go.Scatter(x = users_4["agreement"],
y=users_4["kappa"],
name='GCK U=4'))
fig.add_trace(go.Scatter(x = users_5["agreement"],
y=users_5["kappa"],
name='GCK U=5'))
fig.update_layout(title='Generalized Cohen\'s kappa v. Original Cohen\'s kappa - Kappa score',
xaxis_title='Initial agreement')
fig.show()
I tried changing the probabilities for which each code is applied to see if that is what affects the chance agreement.
cohen_scores_dp = pd.read_csv("distributions/output/original_cohen_dp.csv")
cohen_scores_dp.sort_values(by= "agreement", inplace = True)
cohen_scores_dp.head(5)
| agreement | observed | chance | kappa | time | |
|---|---|---|---|---|---|
| 19 | 0.00 | 0.00 | 0.209054 | -0.264309 | 1.325 |
| 5 | 0.05 | 0.05 | 0.211772 | -0.205235 | 1.036 |
| 0 | 0.10 | 0.10 | 0.213424 | -0.144200 | 1.499 |
| 10 | 0.15 | 0.15 | 0.216982 | -0.085543 | 1.138 |
| 1 | 0.20 | 0.20 | 0.218269 | -0.023370 | 1.035 |
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["chance"],
name='Same probabilities'))
fig.add_trace(go.Scatter(x = cohen_scores_dp["agreement"],
y=cohen_scores_dp["chance"],
name='Different probabilities'))
fig.update_layout(title='Original Cohen\'s kappa - Same v. different probabilities',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
It seems that there is a change. What if we randomly select the probabilities for each agreement? I'm expecting to see a different chance agreement for each initial agreement and not constants as we saw before
cohen_scores_rp = pd.read_csv("distributions/output/original_cohen_rp.csv")
cohen_scores_rp.sort_values(by= "agreement", inplace = True)
cohen_scores_rp.head(5)
| agreement | observed | chance | kappa | time | |
|---|---|---|---|---|---|
| 19 | 0.00 | 0.00 | 0.291203 | -0.410841 | 1.016 |
| 5 | 0.05 | 0.05 | 0.298485 | -0.354212 | 1.138 |
| 0 | 0.10 | 0.10 | 0.325359 | -0.334043 | 1.068 |
| 10 | 0.15 | 0.15 | 0.232729 | -0.107822 | 1.089 |
| 1 | 0.20 | 0.20 | 0.302223 | -0.146498 | 1.076 |
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["chance"],
name='Same probabilities'))
fig.add_trace(go.Scatter(x = cohen_scores_dp["agreement"],
y=cohen_scores_dp["chance"],
name='Different probabilities'))
fig.add_trace(go.Scatter(x = cohen_scores_rp["agreement"],
y=cohen_scores_rp["chance"],
name='Random probabilities'))
fig.update_layout(title='Original Cohen\'s kappa - Same v. different v. random probabilities - Chance agreement',
xaxis_title='Initial agreement', yaxis_range=[0,1])
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x = cohen_scores["agreement"],
y=cohen_scores["kappa"],
name='Same probabilities'))
fig.add_trace(go.Scatter(x = cohen_scores_dp["agreement"],
y=cohen_scores_dp["kappa"],
name='Different probabilities'))
fig.add_trace(go.Scatter(x = cohen_scores_rp["agreement"],
y=cohen_scores_rp["kappa"],
name='Random probabilities'))
fig.update_layout(title='Original Cohen\'s kappa - Same v. different v. random probabilities - Kappa score',
xaxis_title='Initial agreement')
fig.show()